**Discover your next playlist from another decade**
**Million Song Dataset**
Big Data and Cloud Computing Final Project
Chua | Delos Santos | Singson
Hans Zimmer, an international award winning composer and music producer who has composed music for over 100 films, describes music as a means to rediscover your hummanity, and your connection to humanity. This study attempts to bridge generations of humanity through song recommendations across decades. Using the Million Song Dataset from the Laboratory for the Recognition and Organization of Speech and Aduio (LabROSA), a content-based recommendation engine using similarity-based metrics such as Cosine and Jaccard similarity is created that takes a song of your preference, and recommends several songs from another decade of your choice that you might deem similar, or interesting. From our test users, we achieved a best mean song-similarity of 6.1/10, and a mean of 7.25/10 for how much that user likes the recommendation.
According to Global Music Report 2018, the global recorded music market grew by 8.1% last 2017. The people's patronage on the digital market (e.g. Spotify) accounted 54% of the global music market. Out of the 176 million paid subscribers, 36% of these are new accounts. These numbers present how big is the impact of digitalization and how it will continue to create impact in the future.
Continuous efforts are done by the music industry to develop new songs and artists. Moreover, preserving songs is still prevalent since a wide range of songs is provided even those from previous decades. This helps the listeners discover songs that are similar to their current preferences. For instance, a 19-yr old teen who would have naturally listened to 2000s songs might like Michael Jackson songs. Similarly, a 40-yr old woman who mostly listened to songs back in the 1980s might be interested to know more about Lady Gaga. It would be a huge chance to discover songs that were not created outside their generation and expand their song choices. This is commonly done by contestants in singing competitions such as The Voice and America Idol.
In this study, the team focuses on creating a recommender system that suggests songs from a preferred decade that are similar to a chosen song from another decade. Similarity is defined according to features such as tempo and loudness. The team hopes to bridge generations through similar music that are decade/s apart.
The dataset used in this study is the Million Song Dataset(MSD) from Laboratory for the Recognition and Organization of Speech and Audio (LabROSA) in collaboration with Echo Nest. MSD was created for research purposes, primarily to retrieve music informations. It has approximately 280 GB worth of data dated from 1920-2010. The data is provided in HDF5 format, which has the following 55 features as illustrated below:
analysis sample rate (float)-sample rate of the audio used
artist 7digitalid (int)- ID from 7digital.com or -1
artist familiarity (float)- algorithmic estimation
artist hotttnesss (float)- algorithmic estimation
artist id (string) - Echo Nest ID
artist latitude (float) - latitude
artist location (string)- location name
artist longitude (float)- longitude
artist mbid (string) - ID from musicbrainz.org
artist mbtags array (string) - tags from musicbrainz.org
artist mbtags count array (int)- tag counts for musicbrainz tags
artist name (string)- artist name
artist playmeid (int)- ID from playme.com, or -1
artist terms array (string) Echo Nest tags
artist terms freq array (float)- Echo Nest tags freqs
artist terms weight array (float)- Echo Nest tags weight
audio md5 (string)- audio hash code
bars confidence array (float)- confidence measure
bars start array (float)- beginning of bars, usually on a beat
beats confidence array (float)- confidence measure
beats start array (float) - result of beat tracking
danceability (float) - algorithmic estimation
duration (float) - in seconds
end of fade in (float) - seconds at the beginning of the song
energy (float)- energy from listener point of view
key (int)-key the song is in
key confidence (float)- confidence measure
loudness (float)- overall loudness in dB
mode (int)- major or minor
mode confidence (float)- confidence measure
release (string)- album name
release 7digitalid (int)- ID from 7digital.com or -1
sections confidence array (float)- confidence measure
sections start array (float)- largest grouping in a song, e.g. verse
segments confidence array (float)- confidence measure
segments loudness max array (float)- max dB value
segments loudness max time array (float)- time of max dB value, i.e. end of attack
segments loudness max start array (float)- dB value at onset
segments pitches 2D array (float) - chroma feature, one value per note
segments start array (float)- musical events, ~ note onsets
segments timbre 2D array (float)- texture features (MFCC+PCA-like)
similar artists array (string) - Echo Nest artist IDs (sim. algo. unpublished)
song hotttnesss (float)- algorithmic estimation
song id (string)- Echo Nest song ID
start of fade out (float)- time in sec
tatums confidence array (float)- confidence measure
tatums start array (float)- smallest rythmic element
tempo (float)- estimated tempo in BPM
time signature (int)- estimate of number of beats per bar, e.g. 4
time signature confidence (float)- confidence measure
title (string)- song title
track id (string)- Echo Nest track ID
track 7digitalid (int) - ID from 7digital.com or -1
year (int) - song release year from MusicBrainz or 0
The dataset far exceeds the allocated RAM of 8gb for each user in the ACCESS Lab supercomputer (Jojie). Because of this, Spark was used to decrease processing time, and load data greater than the available RAM using the Spark Cluster. Unfortunately, the Spark Cluster was not accessible during the timeframe of this project and the team was limited to the available RAM for a single user. With this, the team had to load a sample of the entire dataset, 10% or 100k songs with 8 fields per song were to be loaded. 100k file paths were stored in a list using Glob. A Pandas-based read function was created and this was parallelized using SparkConntext.parallelize that reads the entire list of paths and maps them to the read function mentioned. This is essentially a parallel pandas read using Spark.
Once the data has been loaded, the Pandas dataframe is then saved to a Parquet file to avoid reprocessing. This would be finally loaded as a Spark Dataframe for multicore processing. From hereon it's quite straightforward, vectorize the features using the vector assembler, create a table for each decade of songs, then compare the song of choice to the entire decade to determine the most similar songs. This is expounded in the subsections of building recommender systems using Cosine Similarity and Jaccard Similarity (Section 6).
For validation, we asked 8 MSDS students, including the team, to rate four recommended songs based on their base song of choice from 1-10 (10 being the highest). Of the four recommended songs, half of which are from the Cosine similarity based model and half are from the Jaccard similarity based model. There are two ratings, first is similarity of the base song vs. the recommended songs from a different decade, and second would be how much they like the recommended song.
loading 1m songs
There are a few strategies in which the team can load all the 1 million songs but it takes a lot of time. Because Spark does not have a native hd5 reader, the team converted each hd5 file to a CSV, resulting in 1 million CSVs. With this one might think that we can immediately combine all these CSVs into one using the Linux shell commands, but this isn't the case because of inherent limitations in commands like ls in listing all 1 million csvs. Because of this, we need to bin together these CSVs before combining, and re-combining each combined file per bin. In this case we chose to bin them at 50k songs each. This entire process is long and troublesome but it can be a decent workaround for not having a Spark Cluster to work with. But for the purpose of demonstrating the recommender system we stick with the 100k dataset.
We loaded the million song dataset in parallel for Spark processing. With limitations on processing power and time, we decided to store reduced data samples in parquet.
import h5py
import glob
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext('local[4]')
session = SparkSession(sc)
#TAKE A RANDOM SAMPLE OF 100K SONGS
import glob
np.random.seed(1337)
file_dirs = np.random.choice(glob.glob('/mnt/data/public/millionsong/*/*/*/*.h5'),200000).tolist()
#reader function for the parallelization, returns a tuple of nd arrays containing the song name, and attributes of the song
def reader(path):
#data = h5py.File(path)
df3 = pd.read_hdf(path,'musicbrainz/songs')
df1 = pd.read_hdf(path,'metadata/songs')
df2 = pd.read_hdf(path,'analysis/songs')
return pd.concat([df1,df2,df3],axis=1)[['danceability','duration','energy','key','loudness','tempo','time_signature','title','artist_name','song_id','year',
'song_hotttnesss']]
#return df1
#return (data['metadata']['songs'].value,data['analysis']['songs'].value)
from functools import reduce
#taking a smaller subset of files just to test if parallel read is working, use the variable file_dirs for all 100k
#songs when you are sure
temp_files = file_dirs[0]
rdd_parallel = sc.parallelize(file_dirs).map(lambda x: reader(x))
#this is one way, pero hindi siya spark dataframe
reduced = (reduce(lambda x,y: pd.concat([x,y]), rdd_parallel.map(lambda x: x).collect()))
reduced.info()
reduced.to_parquet('25ksongs_LT10.parquet.gzip',compression='gzip')
#taking a smaller subset of files just to test if parallel read is working, use the variable file_dirs for all 100k
#songs when you are sure
temp_files = file_dirs[0:5]
rdd_parallel = sc.parallelize(file_dirs).map(lambda x: reader(x))
#this is one way, pero hindi siya spark dataframe
reduced = (reduce(lambda x,y: pd.concat([x,y]), rdd_parallel.map(lambda x: x).collect()))
reduced.info()
reduced.to_parquet('100ksongs_LT10.parquet.gzip',compression='gzip')
For the EDA, the team simply used a smaller but representative subset of the data to identify some key insights with regards to the dataset we are exploring.
import pandas as pd
import numpy as np
import re
import ipywidgets as widgets
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.offline as pyoff
from IPython.display import display
from IPython.display import IFrame
pyoff.init_notebook_mode(connected = False)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
reduced=pd.read_parquet('25ksongs_LT10.parquet.gzip')
df = reduced[['artist_hotttnesss', 'artist_latitude','artist_longitude','artist_name','song_hotttnesss',
'song_id', 'title','duration','key','loudness','tempo','year','genre_tags']]
df.to_csv('data.csv')
df = pd.read_csv('data.csv')
While playing with the dataset, the team asked the question, how often do these songs duplicate in title? Surprisingly, there exists a single title that was used by 30 artists! Disappointingly, the title of this song was Intro, which is the title commonly used by artists for their instrumental playlists in the beginning of a CD album. We also showed below song titles that were repeatedly used by at least 6 times.
df_xy = df.groupby('title')[
['song_id']].count()
#desription for duplicate song titles
df_xy.describe()
#song titles with greater than 5 duplicates.
df_xy[df_xy['song_id']>5]
On average, songs are rated only 0.36. The team thought that this was strange and further investigated. It turns out that there are a significant amount of song that are marked as 0, and NaN in the dataset. Of the almost 45k songs in this Dataframe, approximately 30k have no hotttnesss rating. Now this can either mean that the song was a complete flop, or that it was not popular enough to be given a rating. Even so, the team decided to take the mean while discounting these songs with 0 hotness. So in reality, the mean hotness is 0.49 and standard deviation of 0.16.
#mean rating with NaN
df.song_hotttnesss.dropna().mean()
#number of NaN in the hotttnesss field
len(df.song_hotttnesss) - len(df.song_hotttnesss.dropna())
#mean hotness
df_nonan = df.dropna()
df_nonan[df_nonan['song_hotttnesss']>0].song_hotttnesss.mean()
#mean stdev
df_nonan[df_nonan['song_hotttnesss']>0].song_hotttnesss.std()
#Distribution of song hotness
sns.distplot(df_nonan[df_nonan['song_hotttnesss']>0].song_hotttnesss, hist=True, kde=False,
bins = 100,color = 'blue',
hist_kws={'edgecolor':'black'})
Geographically, majority of the songs originated from North America and Europe. It can also be visually observed that most songs have song_hotttnesss that is within the 0.4-0.6 range.
df_coordinates = df
df_xy = df
data_xy = [go.Scattergeo(
lon=df_coordinates['artist_longitude'],
lat=df_coordinates['artist_latitude'],
text=df_coordinates['title'].values,
mode='markers',
hoverinfo='text',
marker=dict(
color=df_xy['song_hotttnesss'],
cmax=df_xy['song_hotttnesss'].max(),
colorbar=dict(
title="Song Hotness"
))
)]
layout_xy = go.Layout(
title='Where Did the Songs Originate?',
geo=dict(
scope="world",
showframe=False,
showcoastlines=False,
showland=True,
landcolor="rgb(150, 150, 150)",
countrycolor="rgb(255, 255, 255)",
coastlinecolor="rgb(255, 255, 255)",
)
)
fig_xy = go.Figure(data=data_xy, layout=layout_xy)
pyoff.iplot(fig_xy)
genre_year = df[['genre_tags','year']].dropna()
genre_year['year']= genre_year.year.astype(int)
len(genre_year.genre_tags.unique())
The bar graph presents the 20 most common genres from 1920s-2010. It can be observed that classic pop and rock and uk have the greatest number of songs. Also, notice that people usually create rock songs and only differ whether it is classic pop and rock, rock and indie,rock, or alternative rock. This graph shows how influential rock songs are throughout the decades.
#Top 20 genres
genre_year.genre_tags.value_counts().nlargest(20).plot(kind='barh',figsize=(10,10))
plt.xlabel('Count of Songs')
plt.ylabel('Genre')
plt.title('Top 20 Genres of All Time')
GENRES2 = ['classic pop and rock', 'uk', 'rock and indie', 'folk', 'british', 'american', 'hip hop mb and dance hall',
'punk', 'german', 'french', 'rock', 'finnish', 'country', 'jazz and blues',
'pop and chart','jazz','alternative rock','production music','dance and electronica', 'soul and reggae']
gen2 = dict()
i = 0
for each in GENRES2:
gen2[each] = i
i += 1
pipeline2 = Pipeline([('cc', CountVectorizer(vocabulary=gen2))])
df2 =genre_year
gen_data2 = dict()
gen_data2['1920'] = pipeline2.fit_transform(df2[(df2['year'] >= 1920.0) & (
df2['year'] < 1930.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['1930'] = pipeline2.fit_transform(df2[(df2['year'] >= 1930.0) & (
df2['year'] < 1940.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['1940'] = pipeline2.fit_transform(df2[(df2['year'] >= 1940.0) & (
df2['year'] < 1950.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['1950'] = pipeline2.fit_transform(df2[(df2['year'] >= 1950.0) & (
df2['year'] < 1960.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['1960'] = pipeline2.fit_transform(df2[(df2['year'] >= 1960.0) & (
df2['year'] < 1970.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['1970'] = pipeline2.fit_transform(df2[(df2['year'] >= 1970.0) & (
df2['year'] < 1980.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['1980'] = pipeline2.fit_transform(df2[(df2['year'] >= 1980.0) & (
df2['year'] < 1990.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['1990'] = pipeline2.fit_transform(df2[(df2['year'] >= 1990.0) & (
df2['year'] < 2000.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['2000'] = pipeline2.fit_transform(df2[(df2['year'] >= 2000.0) & (
df2['year'] < 2010.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
gen_data2['2010'] = pipeline2.fit_transform(df2[(df2['year'] >= 2010.0) & (
df2['year'] < 2020.0)]['genre_tags'].dropna()).toarray().sum(axis=0)
df3 = pd.DataFrame.from_dict(gen_data2)
df3.index=GENRES2
df3.to_csv('genre_year.csv')
df3= pd.read_csv('genre_year.csv')
df3.columns = ['Genre','1920','1930','1940','1950','1960','1970','1980','1990','2000','2010']
df3 = df3.set_index('Genre')
When segmented by decade, songs usually come from the 2000s. The trend of song counts decreases as decade becomes earlier. Artists consistently create classic pop and rock songs over time as opposed to the common notion that 2000s has been a pop generation.
data_plot = [
go.Scatter(
x=df3.T.index,
y=df3.T[each],
name = each
) for each in df3.T.columns
]
layout = go.Layout(
title = 'What Song Genres Do People Listen Over Time?',
yaxis = dict( title = 'Count of Songs')
)
fig = go.Figure(data=data_plot, layout=layout)
pyoff.iplot(fig)
Among all the features including artist_hotttnesss, duration, key, loudness, tempo and year, artist_hotttnesss correlate the most to song_hotttnesss. It is quite intuitive to say that a song from a well-known artist usually receives a higher song rating as compared to a song from a newbie artist.
#How is Song hotness related to other features
import seaborn as sns
corr = (df[['song_hotttnesss', 'artist_hotttnesss','duration','key','loudness','tempo','year']].dropna()).corr()
sns.heatmap(corr, robust=True,linewidths=0.2,cmap='viridis').set_title('How Do Other Features Correlate to Song Hotness?')
#Hottest Songs of All Time
hottest_songs = df[['song_hotttnesss','title']].dropna()
hottest_songs = hottest_songs.drop_duplicates()
top_songs = hottest_songs.sort_values(by=['song_hotttnesss'], ascending=False)
top_songs = top_songs.set_index('title')
The bar graph below shows the 50 hottest songs of all time. Values of song_hotttnesss range from 0-1, and it can be observed that only three songs got a perfect rating of 1 such as White Room, When A Man Loves A Woman and Bitter Sweet Symphony.
top_songs.head(50).plot(kind='barh',figsize=(20,20))
plt.xlabel('Song Hotness')
plt.ylabel('Song Title')
plt.title('Hottest Songs of All Time')
#How do people identify hot songs?
avg = df.groupby('title')[['tempo','artist_hotttnesss','loudness','duration']].mean().dropna()
htf = df[['song_hotttnesss','artist_hotttnesss','title']].groupby(
'title')[['song_hotttnesss', 'artist_hotttnesss']].max().dropna()
q1 = htf[(htf['artist_hotttnesss'] > htf['artist_hotttnesss'].mean()) & (
htf['song_hotttnesss'] > htf['song_hotttnesss'].mean())]
q2 = htf[(htf['artist_hotttnesss'] <= htf['artist_hotttnesss'].mean()) & (
htf['song_hotttnesss'] > htf['song_hotttnesss'].mean())]
q3 = htf[(htf['artist_hotttnesss'] <= htf['artist_hotttnesss'].mean()) & (
htf['song_hotttnesss'] <= htf['song_hotttnesss'].mean())]
q4 = htf[(htf['artist_hotttnesss'] > htf['artist_hotttnesss'].mean()) & (
htf['song_hotttnesss'] <= htf['song_hotttnesss'].mean())]
avg1 = avg.loc[q1.index,:].mean()
avg2 = avg.loc[q2.index,:].mean()
avg3 = avg.loc[q3.index,:].mean()
avg4 = avg.loc[q4.index,:].mean()
avgs = np.array([avg1,avg2,avg3,avg4])
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
avgs_norm = scaler.fit_transform(avgs)
The song statistics provide information on what makes a song a hit or a flop. Classes are divided into four, namely: Classic Hit Songs, New Hit Releases, F-list Songs and Bad New Releases. Clsssic Hit Songs have a wide range of loudness, tempo and artist_hotttnesss while having duration. New Hit Releases follow the Classic Hit Songs' formula, having songs that are louder and have faster beat while being sang by newbie artists. F-list Songs typically have high duration and lowtempo. On the other hand, Bad New Releases comprised of songs from known artists that have high duration which in turn got a low song rating.
layout_avgs = go.Layout(
title='Song Statistics',
paper_bgcolor='rgb(243, 243, 243)',
plot_bgcolor='rgb(243, 243, 243)',)
data1 = [go.Scatterpolar(r=avgs_norm[0], theta=['Tempo','Artist Hotness','Loudness','Duration'], fill='toself', name='Classic Hit Songs'),
go.Scatterpolar(r=avgs_norm[1], theta=['Tempo','Artist Hotness','Loudness','Duration'], fill='toself', name='New Hit Releases'),
go.Scatterpolar(r=avgs_norm[2], theta=['Tempo','Artist Hotness','Loudness','Duration'], fill='toself', name='F-list Songs'),
go.Scatterpolar(r=avgs_norm[3], theta=['Tempo','Artist Hotness','Loudness','Duration'], fill='toself', name='Bad New Releases')]
fig_grp=go.Figure(data=data1, layout=layout_avgs)
pyoff.iplot(fig_grp)
tempo_yr = df[['tempo','year']]
tempo_yr=tempo_yr.dropna()
fin_tempo_yr= tempo_yr[tempo_yr['year'] != 0.0]
def decade(i):
if 1920.0 <=i <1930.0:
return '1920'
if 1930.0 <=i < 1940.0:
return '1930'
if 1940.0 <=i <1950.0:
return '1940'
if 1950.0 <=i < 1960.0:
return '1950'
if 1960.0 <=i <1970.0:
return '1960'
if 1970.0 <=i < 1980.0:
return '1970'
if 1980.0 <=i <1990.0:
return '1980'
if 1990.0 <=i < 2000.0:
return '1990'
if 2000.0 <=i < 2010.0:
return '2000'
else:
return '2010'
fin_tempo_yr['decade'] = fin_tempo_yr['year'].apply (lambda i: decade (i))
The graph below illustrates an increasing trend of tempo over time. It can be observed that from the late 90s to 2000s that the music industry continually evolves, offering a wide range of music not limited to a certain tempo bracket.
from matplotlib import pyplot
import seaborn as sns
fig, ax = pyplot.subplots(figsize=(20,10))
sns.boxplot(fin_tempo_yr['decade'], fin_tempo_yr['tempo'])
ax.set_title('How Does Tempo Change Over Time?')
Here's another illustration of how tempo changes overtime.
fig,ax = pyplot.subplots(figsize=(20,10))
sns.regplot(fin_tempo_yr['year'],fin_tempo_yr['tempo'],marker='+',fit_reg=False)
ax.set_title("How does Tempo change overtime?")
Model Assumptions and Limitations
Recommending content involves making a prediction about how likely it is that a user is going to like a recommended content [1]. The type of recommender system used in this notebook is a content-based recommender system. These types of recommender systems rely on features extracted or inherently present in the items which you would like to recommend. There are numerous methods to implement content-based recommenders, the simplest of which is using similarity based metrics and in this case, Cosine similarity.
To be able to apply non-Spark functions across a Spark Dataframe requires you to instantiate your function as a UDF (User-defined function). It's also worthwhile to remember that because Spark is more akin to Java programming than Python in itself, most functions that involve data structures or return values require a return type, and there are several primitive and non-primitive types to choose from in the pyspark.sql functions library.
With user defined functions in mind, we can now identify the Cosine distance (or Similarity) of a particular song from another decade, to an entire decade of songs with the simple formula below:
\begin{equation} Similarity = Cos(\theta) = \frac{A \cdot B}{||A||\hspace{1mm}||B||} = \frac{\sum_{i=1}^{n} A_iB_i}{\sqrt{\sum_{i=1}^{n}A^2_i}\sqrt{\sum_{i=1}^{n}B^2_i}} \end{equation}
This isn't too hard to manually implement but there is already an available function for use under the Scipy library which makes implementing this all the more convenient.
Two functions were created, find_song and get_similar. The find_song function's purpose is as its name implies. It returns a Spark Dataframe which can be used in get similar. get_similar takes in a Spark Dataframe, and a decade of your choice from the views in section 5.1. What this function does is that it takes every item in the Spark Dataframe and computes the distance of each item every item in the decade of choice. It returns another Spark Datafraame ordered by most similar.
Citations: [1] https://www.offerzen.com/blog/how-to-build-a-content-based-recommender-system-for-your-product
import pandas as pd
reduced = pd.read_parquet('100ksongs_LT10.parquet.gzip')
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions as psf
sc = SparkContext('local[4]')
session = SparkSession(sc)
sqlContext = SQLContext(sc)
spark_df = sqlContext.createDataFrame(reduced)
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols = ['duration',
'key',
'loudness',
'tempo',
'time_signature'],
outputCol = "feature_col")
spark_df_assem = assembler.transform(spark_df)
spark_df_assem.show(5)
spark_df_assem.createOrReplaceTempView('spark_df')
dec_2k = session.sql('SELECT * FROM spark_df WHERE year LIKE "200%"')
dec_2k10 = session.sql('SELECT * FROM spark_df WHERE year LIKE "201%"')
dec_90s = session.sql('SELECT * FROM spark_df WHERE year LIKE "199%"')
dec_80s = session.sql('SELECT * FROM spark_df WHERE year LIKE "198%"')
dec_70s = session.sql('SELECT * FROM spark_df WHERE year LIKE "197%"')
dec_60s = session.sql('SELECT * FROM spark_df WHERE year LIKE "196%"')
dec_50s = session.sql('SELECT * FROM spark_df WHERE year LIKE "195%"')
dec_40s = session.sql('SELECT * FROM spark_df WHERE year LIKE "194%"')
dec_30s = session.sql('SELECT * FROM spark_df WHERE year LIKE "193%"')
dec_20s = session.sql('SELECT * FROM spark_df WHERE year LIKE "192%"')
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from pyspark.sql.functions import desc
from scipy import spatial
cossim_udf = udf(lambda x,y: float(1-spatial.distance.cosine(x,y)),returnType=FloatType())
dec_2k10.createOrReplaceTempView('dec_2k10')
dec_2k.createOrReplaceTempView('dec_2k')
dec_90s.createOrReplaceTempView('dec_90s')
dec_80s.createOrReplaceTempView('dec_80s')
dec_70s.createOrReplaceTempView('dec_70s')
dec_60s.createOrReplaceTempView('dec_60s')
dec_50s.createOrReplaceTempView("dec_50s")
dec_40s.createOrReplaceTempView('dec_40s')
dec_30s.createOrReplaceTempView('dec_30s')
dec_20s.createOrReplaceTempView('dec_20s')
def find_song(song_keyword=None,artist_keyword=None):
#view is spark_df
to_append_song_keyword = "'%" +str(song_keyword) + "%'"
to_append_artist_keyword = "'%" +str(artist_keyword) + "%'"
if song_keyword==None:
string_pattern = 'SELECT * FROM spark_df WHERE artist_name LIKE '
final = string_pattern+to_append_artist_keyword
elif artist_keyword==None:
string_pattern = 'SELECT * FROM spark_df WHERE title LIKE '
final = string_pattern+to_append_song_keyword
else:
string_pattern = 'SELECT * FROM spark_df WHERE title LIKE '
combi = ' AND '
string_pattern2 = 'artist_name LIKE '
final = string_pattern+to_append_song_keyword+combi+string_pattern2+to_append_artist_keyword
print(final)
return session.sql(final)
query = find_song('Gravity','Sara')
query.show()
def get_similar(query,target_decade):
query.createOrReplaceTempView('q')
string_pattern = 'SELECT distinct_T1.title as title1, distinct_T1.artist_name as artist_name1\
, distinct_T1.feature_col as feature_col1, \
distinct_T2.title, distinct_T2.artist_name, distinct_T2.feature_col FROM '
string2 = 'q AS distinct_T1 CROSS JOIN '
string4 = ' AS distinct_T2'
final = string_pattern+string2+target_decade+string4
print(final)
new_table = session.sql(final)
new_table = new_table.withColumn('Similarity',cossim_udf('feature_col1','feature_col'))
return new_table.sort(desc('title1'),desc('Similarity'))
#search which decade the song comes from first
get_similar(query,'dec_90s').show(10)
Load reduced dataset with 100k songs
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
reduced = pd.read_parquet('100ksongs_LT10.parquet.gzip')
reduced.info()
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions as psf
sc = SparkContext('local[4]')
session = SparkSession(sc)
sqlContext = SQLContext(sc)
spark_df = sqlContext.createDataFrame(reduced)
spark_df.show(5)
spark_df.select([psf.count(psf.when(psf.isnan(c), c)).alias(c) for c in spark_df.columns]).show()
spark_df.select('danceability',
'duration',
'energy',
'key',
'loudness',
'tempo',
'time_signature').summary().show()
Data pre-processing: Data Binning and transformation to categorical types
For the duration, loudness and tempo values, we performed data binning to replace the continuous value with a representative of the interval it belongs to. For each field, we used the standard deviation as the bin size. The endpoints of the intervals were computed by adding and subtracting multiples of standard deviation to mean value, such that all data points were represented. Also, this approach retains the shape of the distribution. We used Pysparsk Bucketizer to automatically assign the representative interval, which converts the field into a categorical type. As for the key and time_signature, they were immediately transformed into categorical values since they integer types already. All of these fields were transformed into categorical types because they will be used as features for Jaccard Index Similarity.
import numpy as np
fields = ['duration', 'loudness', 'tempo']
stats = ['mean', 'std', 'max', 'min', 'range']
stats_dict = {}
for c in fields:
keyname = c + '_mean'
stats_dict[keyname] = np.round(spark_df.select(psf.mean(psf.col(c)).alias('stat')).collect()[0]['stat'], 3)
for c in fields:
keyname = c + '_std'
stats_dict[keyname] = np.round(spark_df.select(psf.stddev(psf.col(c)).alias('stat')).collect()[0]['stat'], 3)
for c in fields:
keyname = c + '_max'
stats_dict[keyname] = spark_df.select(psf.max(psf.col(c)).alias('stat')).collect()[0]['stat']
for c in fields:
keyname = c + '_min'
stats_dict[keyname] = spark_df.select(psf.min(psf.col(c)).alias('stat')).collect()[0]['stat']
for c in fields:
keyname = c + '_range'
stats_dict[keyname] = np.round(stats_dict[(c + '_max')] - stats_dict[(c + '_min')], 3)
stats_dict
from pyspark.ml.feature import Bucketizer
splits_dict = {}
for colname in fields:
_splits = []
_splits.append(-1 * float('Inf'))
meanval = stats_dict[colname + '_mean']
stdval = stats_dict[colname + '_std']
n = int(np.ceil(stats_dict[colname + '_range'] / stdval))
if n % 2 != 0: # if odd number
n += 1
minval = meanval - (np.round(n/2)) * stdval
for i in range(n):
_splits.append(float(minval + i*stdval))
_splits.append(float('Inf'))
splits_dict[colname] = _splits
bucketizer = Bucketizer(splits=_splits, inputCol=colname,
outputCol=(colname+"_buckets"), )
spark_df = bucketizer.setHandleInvalid("keep").transform(spark_df)
splits_dict
spark_df.show()
spark_df.select('duration', 'duration_buckets', 'loudness', 'loudness_buckets', 'tempo', 'tempo_buckets').show()
spark_df.select('duration', 'duration_buckets', 'loudness', 'loudness_buckets', 'tempo', 'tempo_buckets').show()
import matplotlib.pyplot as plt
bins, counts = spark_df.select("duration").rdd.flatMap(lambda x: x).histogram(100)
plt.hist(bins[:-1], bins=bins, weights=counts);
colname = 'duration_buckets'
keys_df = spark_df.groupBy(colname).count().orderBy('count', ascending=False)
keys_arr = [int(row[colname]) for row in keys_df.collect()]
keys_count = [int(row['count']) for row in keys_df.collect()]
plt.bar(keys_arr, keys_count)
bins, counts = spark_df.select("loudness").rdd.flatMap(lambda x: x).histogram(100)
plt.hist(bins[:-1], bins=bins, weights=counts);
colname = 'loudness_buckets'
keys_df = spark_df.groupBy(colname).count().orderBy('count', ascending=False)
keys_arr = [int(row[colname]) for row in keys_df.collect()]
keys_count = [int(row['count']) for row in keys_df.collect()]
plt.bar(keys_arr, keys_count)
bins, counts = spark_df.select("tempo").rdd.flatMap(lambda x: x).histogram(20)
plt.hist(bins[:-1], bins=bins, weights=counts);
colname = 'tempo_buckets'
keys_df = spark_df.groupBy(colname).count().orderBy('count', ascending=False)
keys_arr = [int(row[colname]) for row in keys_df.collect()]
keys_count = [int(row['count']) for row in keys_df.collect()]
plt.bar(keys_arr, keys_count)
spark_df.select('year').summary().show()
spark_df.groupBy('year').count().orderBy('count', ascending=False).show(20)
spark_df.groupBy('key').count().orderBy('key').show(20)
keys_df = spark_df.groupBy('key').count().orderBy('count', ascending=False)
keys_arr = [int(row['key']) for row in keys_df.collect()]
keys_count = [int(row['count']) for row in keys_df.collect()]
plt.bar(keys_arr, keys_count)
plt.xticks(list(range(12)));
spark_df.groupBy('time_signature').count().orderBy('time_signature').show()
time_signature_df = spark_df.groupBy('time_signature').count().orderBy('count', ascending=False)
keys_arr = [int(row['time_signature']) for row in time_signature_df.collect()]
keys_count = [int(row['count']) for row in time_signature_df.collect()]
plt.bar(keys_arr, keys_count)
plt.xticks(list(range(8)));
In order for us to perform Jaccard Index Similarity computation, we aggreated the features into an array. Having a list/array type, we can use the function set to find the intersection and union of the 2 songs to be compared.
songs_df = spark_df.select(
'song_id',
'title',
'artist_name',
'year',
psf.concat(psf.lit("k_"), psf.col("key")).alias('key'),
psf.concat(psf.lit("ts_"), psf.col("time_signature")).alias('time_signature'),
psf.concat(psf.lit("d_"), psf.col("duration_buckets")).alias('duration'),
psf.concat(psf.lit("l_"), psf.col("loudness_buckets")).alias('loudness'),
psf.concat(psf.lit("t_"), psf.col("tempo_buckets")).alias('tempo')
)
songs_df.show(5)
spark_df_assem = songs_df.withColumn("features",psf.array('key', 'time_signature', 'duration', 'loudness', 'tempo'))
spark_df_assem.select(
'song_id',
'artist_name',
'year',
'features').show(5, truncate=False)
spark_df_assem.createOrReplaceTempView('spark_df')
For our testing, we tried to find songs from different decades. Later, we will find similar songs for each of the decades from 2k1 to 20s.
dec_2k10 = session.sql('SELECT * FROM spark_df WHERE year LIKE "201%"')
dec_2k10.select(
'song_id',
'artist_name',
'year',
'features').show(10, truncate=False)
dec_2k10.select(
'song_id',
'artist_name',
'year',
'features').orderBy('year', ascending=False).show(10, truncate=False)
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%Spice Girls%'")
dec_test.select(
'song_id',
'title',
'artist_name',
'year').orderBy('title').show(truncate=False)
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%Michael Jackson%'")
dec_test.select(
'song_id',
'title',
'year').orderBy('title').show(truncate=False)
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%Sara Bareilles%'")
dec_test.select(
'song_id',
'title',
'year').orderBy('title').show(truncate=False)
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%Elvis Pres%'")
dec_test.select(
'song_id',
'title',
'year').orderBy('title').show(truncate=False)
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%Doris Day%'")
dec_test.select(
'song_id',
'title',
'year').orderBy('title').show(truncate=False)
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%Abba%'")
dec_test.select(
'song_id',
'title',
'year').orderBy('title').show(truncate=False)
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%Kelly Clarkson%'")
dec_test.select(
'song_id',
'title',
'year').orderBy('title').show(truncate=False)
df_test = session.sql('SELECT DISTINCT * FROM spark_df WHERE year LIKE "199%" and song_id == "SOFBZYB12A8C13C29A"')
# df_test = session.sql('SELECT * FROM spark_df WHERE year LIKE "0" and song_id == "SORKKFF12AB0189420"')
df_test.show()
dec_2k10.createOrReplaceTempView('Table1')
df_test.createOrReplaceTempView('Table2')
df_crossjoin = session.sql('SELECT distinct_T1.title as title1, distinct_T1.artist_name as artist_name1\
, distinct_T1.features as features1, \
distinct_T2.title, distinct_T2.artist_name, distinct_T2.features as features2 \
FROM Table1 AS distinct_T1 CROSS JOIN Table2 AS distinct_T2')
df_crossjoin.show(5)
from pyspark.sql.types import DoubleType
jaccard_udf = psf.udf(lambda x, y: float(len(set(x).intersection(set(y))) / float(len(set(x)) + len(set(y)) - len(set(x).intersection(set(y))))), DoubleType())
df_result = df_crossjoin.withColumn(
'jaccard_similarity', jaccard_udf('features1', 'features2'))
df_result.select('title1', 'artist_name1', 'title', 'artist_name',
'jaccard_similarity').orderBy('jaccard_similarity', ascending=False).show(20)
For our recommender system using Jaccard Index Similarity, we created a function that accepts the song_id and decade for which we will look for similar songs. To compute Jaccard Index Similarity, we determine ratio of the size of intersection and union of the 2 sets. The 2 sets must be composed of the features for each song, which includes the key, time signature, duration, tempo and loudness.
$$J(A,B) = \dfrac{|A \cap B|}{|A \cup B|} = \dfrac{|A \cap B|}{|A| + |B| - |A \cap B|} $$
If the 2 songs are exactly similar based on the given features, they will have a value of 1.0. In general, the Jaccard Index Similarity must have values between 0 and 1.0, inclusive.
dec_test = session.sql("SELECT * FROM spark_df WHERE artist_name like '%ARTIST%'")
dec_test.select(
'song_id',
'title',
'artist_name',
'year').orderBy('title').show(truncate=False)
#MILLIE : Run this cell
# dec_test = session.sql("SELECT * FROM spark_df WHERE title like '%TITLE%'")
# dec_test.select(
# 'song_id',
# 'title',
# 'artist_name',
# 'year').orderBy('title').show(truncate=False)
from pyspark.sql.types import DoubleType
jaccard_udf = psf.udf(lambda x, y: float(len(set(x).intersection(set(y))) / float(len(set(x)) + len(set(y)) - len(set(x).intersection(set(y))))), DoubleType())
def get_decade_table(decade):
dec_table = None
if decade == 'dec_2k1s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "201%"')
elif decade == 'dec_2ks':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "200%"')
elif decade == 'dec_90s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "199%"')
elif decade == 'dec_80s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "198%"')
elif decade == 'dec_70s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "197%"')
elif decade == 'dec_60s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "196%"')
elif decade == 'dec_50s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "195%"')
elif decade == 'dec_40s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "194%"')
elif decade == 'dec_30s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "193%"')
elif decade == 'dec_20s':
dec_table = session.sql(
'SELECT * FROM spark_df WHERE year LIKE "192%"')
return dec_table
def find_similar_songs(song_id, decade, top_n, truncate=False):
decade_table = get_decade_table(decade)
decade_table.createOrReplaceTempView('Table1')
df_song = session.sql(
"SELECT * FROM spark_df WHERE song_id == '{}'".format(song_id))
df_song.createOrReplaceTempView('Table2')
df_crossjoin = session.sql(
'''SELECT
distinct_T1.title as title,
distinct_T1.artist_name as artist_name,
distinct_T1.features as features,
distinct_T1.year as year,
distinct_T2.title as title2,
distinct_T2.artist_name as artist_name2,
distinct_T2.features as features2
FROM Table1 AS distinct_T1 CROSS JOIN Table2 AS distinct_T2
''')
df_result = df_crossjoin.withColumn(
'jaccard_similarity', jaccard_udf('features', 'features2'))
song_choice = df_song.select('title', 'artist_name', 'year').collect()[0]
print('Song of choice: ', song_choice['title'])
print('Artist: ', song_choice['artist_name'])
print('Year: ', song_choice['year'])
print('\nYour next playlist includes %d songs from decade %s.' %(top_n, decade))
df_result.select('title', 'artist_name', 'year', 'jaccard_similarity').orderBy(
'jaccard_similarity', ascending=False).distinct().show(top_n, truncate=truncate)
For the sample songs using the 100k dataset , it can be observed that the song is closely similar to songs from immediate decades. If the number of decades is at least 2, the similarity started to decline. This kind of result is expected since there were some significant differences on tempo and time signature for different decades. It may be possible that in 3 or 4 decades away, we can still find songs very similar if we were able to process the million songs.
For the sample song Stop by Spice Girls, it can be observed that it is most closely similar to songs in the 2000s and 1980s. The similarity started to decline from 1970s to 1920s.
songchoice_id = 'SOKNLUS12AB0186A2C' # Stop by Spice Girls
find_similar_songs(songchoice_id, 'dec_90s', 20)
songchoice_id = 'SOFBZYB12A8C13C29A' # Stop by Spice Girls
find_similar_songs(songchoice_id, 'dec_2k1s', 20)
find_similar_songs(songchoice_id, 'dec_2ks', 10)
find_similar_songs(songchoice_id, 'dec_90s', 10)
find_similar_songs(songchoice_id, 'dec_80s', 3)
find_similar_songs(songchoice_id, 'dec_70s', 10)
find_similar_songs(songchoice_id, 'dec_60s', 10)
find_similar_songs(songchoice_id, 'dec_50s', 10)
find_similar_songs(songchoice_id, 'dec_40s', 10)
find_similar_songs(songchoice_id, 'dec_30s', 10)
find_similar_songs(songchoice_id, 'dec_20s', 10)
songchoice_id = 'SORMLJU12A8C13EEEE'
find_similar_songs(songchoice_id, 'dec_2k1s', 20)
find_similar_songs(songchoice_id, 'dec_2ks', 10, True)
find_similar_songs(songchoice_id, 'dec_90s', 10)
find_similar_songs(songchoice_id, 'dec_80s', 10)
find_similar_songs(songchoice_id, 'dec_70s', 10)
find_similar_songs(songchoice_id, 'dec_60s', 10)
find_similar_songs(songchoice_id, 'dec_50s', 10)
find_similar_songs(songchoice_id, 'dec_40s', 10)
find_similar_songs(songchoice_id, 'dec_30s', 10)
find_similar_songs(songchoice_id, 'dec_20s', 10, True)
songchoice_id = 'SOLISQK12A8C1416AF'
find_similar_songs(songchoice_id, 'dec_2k1s', 20)
find_similar_songs(songchoice_id, 'dec_2ks', 10)
find_similar_songs(songchoice_id, 'dec_90s', 20)
find_similar_songs(songchoice_id, 'dec_80s', 10)
find_similar_songs(songchoice_id, 'dec_70s', 10)
find_similar_songs(songchoice_id, 'dec_60s', 10)
find_similar_songs(songchoice_id, 'dec_50s', 10)
find_similar_songs(songchoice_id, 'dec_40s', 10)
find_similar_songs(songchoice_id, 'dec_30s', 10)
find_similar_songs(songchoice_id, 'dec_20s', 10)
Based on the results, Gravity released oin 2004 is similar to Adia released in 1997 by Sarah Machlaclan. As confirmed by one of the authors, the 2 singers are among her preferences due to voice and genre that is why she listens to both their songs. The song I Am Woman of Helen Reddy released in 1972 sounds louder than Gravity by the author still liked the song's tempo and key.
songchoice_id = 'SONZPPA12AF72A9E13'
find_similar_songs(songchoice_id, 'dec_2k1s', 20)
find_similar_songs(songchoice_id, 'dec_2ks', 10)
find_similar_songs(songchoice_id, 'dec_90s', 20)
find_similar_songs(songchoice_id, 'dec_80s', 10)
find_similar_songs(songchoice_id, 'dec_70s', 10)
find_similar_songs(songchoice_id, 'dec_60s', 10)
find_similar_songs(songchoice_id, 'dec_50s', 10)
find_similar_songs(songchoice_id, 'dec_40s', 10)
find_similar_songs(songchoice_id, 'dec_30s', 10)
find_similar_songs(songchoice_id, 'dec_20s', 10)
songchoice_id = 'SORXLWS12AB01866F3'
find_similar_songs(songchoice_id, 'dec_2k1s', 20)
find_similar_songs(songchoice_id, 'dec_2ks', 10)
find_similar_songs(songchoice_id, 'dec_90s', 20)
find_similar_songs(songchoice_id, 'dec_80s', 10)
find_similar_songs(songchoice_id, 'dec_70s', 10)
find_similar_songs(songchoice_id, 'dec_60s', 10)
find_similar_songs(songchoice_id, 'dec_50s', 10)
find_similar_songs(songchoice_id, 'dec_40s', 10)
find_similar_songs(songchoice_id, 'dec_30s', 10)
find_similar_songs(songchoice_id, 'dec_20s', 10)
songchoice_id = 'SOBLILW12A8C143D33'
find_similar_songs(songchoice_id, 'dec_2k1s', 20)
find_similar_songs(songchoice_id, 'dec_2ks', 10)
find_similar_songs(songchoice_id, 'dec_90s', 20)
find_similar_songs(songchoice_id, 'dec_80s', 10)
find_similar_songs(songchoice_id, 'dec_70s', 10)
find_similar_songs(songchoice_id, 'dec_60s', 10)
find_similar_songs(songchoice_id, 'dec_50s', 10)
find_similar_songs(songchoice_id, 'dec_40s', 10)
find_similar_songs(songchoice_id, 'dec_30s', 10)
find_similar_songs(songchoice_id, 'dec_20s', 10)
songchoice_id = 'SOUWYEZ12D0219189A'
find_similar_songs(songchoice_id, 'dec_2k1s', 20)
find_similar_songs(songchoice_id, 'dec_2ks', 10)
find_similar_songs(songchoice_id, 'dec_90s', 20)
find_similar_songs(songchoice_id, 'dec_80s', 10)
find_similar_songs(songchoice_id, 'dec_70s', 10)
find_similar_songs(songchoice_id, 'dec_60s', 10)
find_similar_songs(songchoice_id, 'dec_50s', 10)
find_similar_songs(songchoice_id, 'dec_40s', 10)
find_similar_songs(songchoice_id, 'dec_30s', 10)
find_similar_songs(songchoice_id, 'dec_20s', 10)
Results

Insights
In the 100k dataset the team used, about half of it had no label on years. This really isn't an issue but it does limit the model from recommending more songs from a specific decade. A certain bias may be generated if a large portion of these unlabeled songs are of a certain genre.
It's interesting to see that for some reason, the recommender that uses Cosine similarity is more inclined to suggest rock music, which might be suggestive due to the bias towards rock music of our dataset, but when it comes to the recommender using Jaccard similarity, based on the tests, it is highly biased in recommending softer music. One rationale as to why this is happening is because of the way the team has binned each feature. A more granular binning process might make the model more accurate.
From our observations on the trend of song similarity using jaccacrd index, we can use this technique to find the decade at which the song probably belongs to.
Using the entire 1 million song dataset may improve the model by increasing the available song choices.
Even if the recommended song genres are not aligned to the chosen songs by the respondents, they still find inherent similarities and on average, still like the recommendations.
Million Song Dataset. Retrieved from https://labrosa.ee.columbia.edu/millionsong/
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
Global Music Report 2018. https://www.ifpi.org/downloads/GMR2018.pdf
The Role of Music in Human Culture. https://thoughteconomics.com/the-role-of-music-in-human-culture/
Babiera, Johniel & Nebres, Elisa. Who's Hott? Anatomy of the hottest artists in the music industry.pdf. Asian Institute of Management. 2018.
Special Thank to our respondents:
Elisa Nebres
Johniel Babiera
Earl Abraham
Aian Rosales
Jude Teves
Josh Hiwatig
AC Arcin
Patricia Manasan
Chichan Soriano
Jon Colipapa
Bingbong Recto
Miguel Valdez